ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group94b.txt / 000124_icon-group-sender _Fri Nov 25 14:30:48 1994.msg < prev next >

Wrap

Internet Message Format | 1995-02-09 | 5KB

Received: by cheltenham.cs.arizona.edu; Fri, 25 Nov 1994 07:42:26 MST Original-Via: Pp-Warning: Illegal Via field on preceding line From: ROBERT VAN DER ZWAN <RZWAN@dish.gla.ac.uk> To: icon-group@cs.arizona.edu Date: Fri, 25 Nov 1994 14:30:48 GMT Subject: textual analysis/tilt project glasgow uni. Priority: normal X-Mailer: PMail v3.0 (R1) Message-Id: <49B163D1DE4@dish.gla.ac.uk> Errors-To: icon-group-errors@cs.arizona.edu Tilt C History Checklist textual analysis 9/11/94 1. Textual database preparation/running * a. allowing import of extended-character set ASCII text (including main European languages) Importing should be easy to handle. * b. recognition of simple mark up (for stucture of a text (chapters, pages etc.) and for elements of content. Mark up of structure of a text is essential for possibility of performing searches in parts of the text. - Mark up preferably SGML because of possibilities of interchange. - Also of importance: possibilities of (semi)-automatic markup. - It should be possible to hide the mark-up. 2. Vocabulary overview (providing rough pointers to the nature and content of a text). * a. Type-token ratio * b. Complete word list with frequency count, displayable both in alphabetic order and in order of frequency, for all or a predetermined part of the text. Also selected wordlist (as opposed to complete) c. token-character ratio (which should give rough average of lenghts of words) 3. Content retrieval facilities. NB. As much as possible of a-e should be done in conjunction and should be subject to 'filtering' (treating only limited parts of the text) * a. word searches including use of wild cards and Boolean operators. * b. combined search for user-defined clusters of semantically unrelated but near synonymous words (noble, aristocr*) * c. search for word pairs (f.e. social contract) and proximate associates (mandatories of the people), rights of man/woman) * d. search for roots and lemma's (f.e.: oligarchy, monarchy, noble for ennoblement, nobility. - this could be done by the use of wildcards, but preferable by way of parsing.? * e. collocation (including a user defined span) producing a z- score, which indicates the measure of probability that words are used together on purpose. f. macine generated search strategy via thesaurus (preferably user-trainable thesaurus, to accomodate variable historical usage), were potential related words are offered from thesaurus for confirmation or rejection by searcher. 4. Additional quantitative/stylistic facilities (extending basics of 2 above and currently achievable only through combination of various software) * a. enhancement of word frequency list (2/b.) by means of statistical options to calculate how much unique words, twice occurring words, and so on up to high frequency words contribute both to the total vocabulary and to the total word length (?) (useful to assess the audience for which an author may conciously or unconsciously have wanted to address and to refine the potentially misleading type-token ratio. * b. graphical display of frequencies of unique words and so on. c. direct quantification of word - and sentence length (see 2c above) (paragraph length is not meaningful for most historical texts and therefore not necessary). d. quantification of use of question marks, passive voice etc. e. simple parsing to assist with 3 d-f. (allowing to exclude f.e. all function words or search for nouns only etc.) 5 Display functions: * a. keyword(s) displayed in full text (highlit), and in concordance form (index, user-definable KWIC), giving location- reference by line or marked-up section (chapter, page etc.) or both. * b. 'topographical' distribution display, showing clustering of keyword(s) over the entire text or user-specified sections of that text. c. free movement between displays without the need for new retrieval. 6 User facilities: * a. simple interface for 'naive' users: - all functions available by menu and or icon - preferably Windows compatible - step by step guidance through procedures - no use of difficult terms, or good help function available. * b. easy output of results (to printer, wordproccessor, database package or spreadsheet), preferably by using cut and paste option in Windows. * c. reasonable speed of performance for complex retrievals (f.e. collocations) and large bodies of text (2-5 Mb)